According to the model selection the best model structure is: y ~ x1 + x2 + x4 + error. Now we’re going to test the model further.

pacman::p_load(ggplot2, ggthemes, tidyr, patchwork, plotly)

Fit the model

data = read.csv("data/x_y.csv", header = F)
colnames(data) = c("x", "y")

Construct X matrix

X = cbind(data$x,
          data$x^2,
          data$x^4
          ) 

Fit model, generate predictions, compute error (SSE)

theta = solve(crossprod(X), crossprod(X, data$y))

y_pred = X %*% theta

residuals = (data$y - y_pred)
SSE = sum(residuals^2)

sigma_sq = SSE/(nrow(X) - 1)

Residual analysis

hist(residuals)

qqnorm(residuals)

Parameter covariance matrix

cov = sigma_sq * (solve(t(X) %*% X))
print(cov)
##               [,1]          [,2]          [,3]
## [1,]  3.841759e-05  4.010982e-06 -1.065549e-06
## [2,]  4.010982e-06  3.450809e-05 -4.194641e-06
## [3,] -1.065549e-06 -4.194641e-06  7.424601e-07

Parameter uncertainty pdf

Because we have 3 parameters, we’ll need to plot all their combinations resulting in 3 plots.

First we create grid of possible parameter values for which we estimate the uncertainty.

n_points = 50
range = 0.01
x1_grid = seq(theta[1]-range, theta[1]+range, length = n_points)
x2_grid = seq(theta[2]-range, theta[2]+range, length = n_points)
x4_grid = seq(theta[3]-range, theta[3]+range, length = n_points)

t = rbind(x1_grid[50], x2_grid[50], x4_grid[50])

Now we can calculate the pdf

cov_inv = (t(X) %*% X) * (1/sigma_sq)
cov_det = det(cov)

n_params = 3

Parameters x1 and x2

Parameters x1 and x4

Parameters x2 and x4

Model’s prediction

Now we’ll generate predictions on training data and 95% confidence intervals

Plot the predictions + CI

Model validation

Validate the model on new train-test split - 70:30

Construct X matrices (train and test)

Re-use the fitting function from model selection

Fit model on training data and predict testing data

## MSE on training data =  0.00969348 
## MSE on testing  data =  0.01120143

Validate model with k-fold cross-validation